Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add startupProbe result handling to kuberuntime #84279

Merged
merged 1 commit into from Nov 13, 2019

Conversation

matthyx
Copy link
Contributor

@matthyx matthyx commented Oct 24, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
StartupProbe failure of a container should (by design) trigger a restart of the Pod, as it's the case for livenessProbes.

Which issue(s) this PR fixes:
Fixes #84178

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 24, 2019
@matthyx
Copy link
Contributor Author

matthyx commented Oct 24, 2019

/sig node
/cc @odinuge
/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 24, 2019
@matthyx matthyx force-pushed the kuberuntime-startupprobe branch 3 times, most recently from d51c86c to ddd5eec Compare October 24, 2019 07:18
@matthyx
Copy link
Contributor Author

matthyx commented Oct 24, 2019

/this-is-unbearable

@k8s-ci-robot
Copy link
Contributor

@matthyx: dog image

In response to this:

/this-is-unbearable

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@odinuge
Copy link
Member

odinuge commented Oct 24, 2019

Thanks!

From the way I read the docs, the KEP and the changes in this PR, I think the comments in the tests should be updated. Isn't the correct behavior to restart the container when the startupProbe fails (not depending on the result of the liveness probe)?

/*
Release : v1.16
Testname: Pod liveness probe, using local file, delayed by startup probe
Description: A Pod is created with liveness probe that uses ‘exec’ command to cat the non-existent /tmp/health file. Liveness probe MUST fail after startup probe expires. The Pod MUST now be killed and restarted incrementing restart count to 1.
*/
framework.ConformanceIt("should be restarted with a exec \"cat /tmp/health\" because startup probe does not delay it long enough [NodeConformance]", func() {


More detailed comment here: #84178 (comment)

klog.V(1).Infof("SyncLoop (container unhealthy - liveness): %q", format.Pod(pod))
handler.HandlePodSyncs([]*v1.Pod{pod})
}
case updateStartup := <-kl.startupManager.Updates():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will "race" (not in the sense of a memory race, but the fact that only one of them will get updates to the channel) against this update mechanism, and ignore Success results:

func (m *manager) updateStartup() {
update := <-m.startupManager.Updates()
started := update.Result == results.Success
m.statusManager.SetContainerStartup(update.PodUID, update.ContainerID, started)
}

Not sure if that is intended or not..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the channel isn't shared... one listens from startupManager.Updates and the other from livenessManager.Updates. (maybe I don't get the point)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Looks like the fact that they don't share the channel is true, since it is created two startupManagers (in this PR).

https://github.com/matthyx/kubernetes/blob/323f99ea8c4faf44090b7554fbfaad2d25120b65/pkg/kubelet/prober/prober_manager.go#L111-L112

Would the startupManager created in pkg/kubelet/kubelet.go ever run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope so... but I have never really worked on that code...

@matthyx
Copy link
Contributor Author

matthyx commented Oct 24, 2019

Isn't the correct behavior to restart the container when the startupProbe fails (not depending on the result of the liveness probe)?

Normally in that test we should have startupProbe succeed very quickly, to make sure livenessProbe executes and fails - causing the restart.

@odinuge
Copy link
Member

odinuge commented Oct 24, 2019

Normally in that test we should have startupProbe succeed very quickly, to make sure livenessProbe executes and fails - causing the restart.

The command is cmd := []string{"/bin/sh", "-c", "sleep 600"}, and the startupProbe checks for the file /tmp/health, and will never succeed. From what I understand from the KEP the container should restart when the Probe fails for the nth time, no matter what the result of the livenessProbe is. However the comment Liveness probe MUST fail after startup probe expires makes me think that if the livenessProbe succeeds, the container should not restart.

So, should a pod like this restart or just start normally:

cmd := []string{"sleep", "inf"}
livenessProbe := &v1.Probe{
  Handler: v1.Handler{
    Exec: &v1.ExecAction{
      Command: []string{"true"},
    },
  },
}
readinessProbe := &v1.Probe{
  Handler: v1.Handler{
    Exec: &v1.ExecAction{
      Command: []string{"true"},
    },
  },
}
startupProbe := &v1.Probe{
  Handler: v1.Handler{
   Exec: &v1.ExecAction{
      Command: []string{"false"},
    },
  },
}

@matthyx
Copy link
Contributor Author

matthyx commented Oct 24, 2019

should a pod like this restart or just start normally

restart and crash loop backoff

@matthyx
Copy link
Contributor Author

matthyx commented Oct 24, 2019

Isn't the correct behavior to restart the container when the startupProbe fails

Actually the difference between the two tests is the FailureThreshold of startupProbe... in the first case it is very big (60) to make sure we actually inhibit livenessProbe during the short lifespan of the test.
The second case has a bug, because I would expect startupProbe to succeed, allowing livenessProbe to run and fail, killing the container.
I think I should use /bin/true and /bin/false to make that more clear and correct.

@matthyx
Copy link
Contributor Author

matthyx commented Oct 24, 2019

@odinuge thanks for the comments on e2e tests... here is a PR to fix the ambiguities:
#84291

Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/uncc

Deferring to @odinuge, who is already doing a great job with the review :)

@matthyx
Copy link
Contributor Author

matthyx commented Oct 28, 2019

/test pull-kubernetes-e2e-kind
/test pull-kubernetes-kubemark-e2e-gce-big

@matthyx
Copy link
Contributor Author

matthyx commented Oct 31, 2019

Squashed and ready for review! Thanks a lot for all your help @odinuge !
/assign @Random-Liu

Copy link
Member

@odinuge odinuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @matthyx! Hopefully we can get this into beta in v1.17!

/lgtm

Failure Result = 1

// Unknown is encoded as -1 (type Result)
Unknown Result = -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill defer the naming of this to the approvers (eg. Pending)

Failure Result = 1

// Unknown is encoded as -1 (type Result)
Unknown Result = -1
)

func (r Result) String() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the new Result type here (and in ToPrometheusType) as well? I think the current behavior is ok, reporting -1/"UNKNOWN", but it may be nice to be explicit. Ill defer that to the approvers 😄

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 31, 2019
@matthyx
Copy link
Contributor Author

matthyx commented Nov 5, 2019

/assign @tallclair

@matthyx
Copy link
Contributor Author

matthyx commented Nov 12, 2019

@Random-Liu, @dchen1107, @derekwaynecarr, @tallclair, @vishh, @yujuhong this PR is really needed for 1.17, can you please make sure we don't miss code-freeze?

@@ -732,6 +733,7 @@ func TestComputePodActions(t *testing.T) {
mutatePodFn func(*v1.Pod)
mutateStatusFn func(*kubecontainer.PodStatus)
actions podActions
resetStatusFn func(*kubecontainer.PodStatus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we createTestRuntimeManager inside the loop? So that we don't need to reset the status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense... maybe it could go into another PR?

// Success is encoded as "true" (type Result)
Success Result = true
// Success is encoded as 0 (type Result)
Success Result = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startup probe logic is almost the same with liveness probe, right?

Why Success/Failure is sufficient for liveness probe, but not startup probe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Because the init values of liveness probe and startup probes are different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that Unknown is newly introduced, and only used for initializing startup probe, I think this should be safe.

@dchen1107
Copy link
Member

/lgtm with one small concern. Asked @yujuhong take a quick look. If she is fine with it, I will approve it.

@@ -21,7 +21,7 @@ import (
"testing"
"time"

"k8s.io/api/core/v1"
v1 "k8s.io/api/core/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "v1"

Failure Result = 1

// Unknown is encoded as -1 (type Result)
Unknown Result = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just use iota?

const (
  Unknown Result = iota -1
  Success
  Failure
)

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2019
Copy link
Member

@odinuge odinuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2019
@matthyx
Copy link
Contributor Author

matthyx commented Nov 13, 2019

@dchen1107 I think you can approve when you're back!

@dchen1107
Copy link
Member

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, matthyx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2019
@k8s-ci-robot k8s-ci-robot merged commit a08b09d into kubernetes:master Nov 13, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Nov 13, 2019
@matthyx
Copy link
Contributor Author

matthyx commented Nov 13, 2019

Thanks @dchen1107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pod with container with failed StartupProbe stays in Ready: false.
8 participants